In [2]:
%pylab inline
%qtconsole
In [3]:
import sklearn
from sklearn import cluster,datasets
In [7]:
N=10**4
D_gen=2
K_gen=10
In [27]:
X,A_true=sklearn.datasets.make_blobs(n_samples=N, n_features=D_gen, centers=K_gen, cluster_std=1.0, center_box=(-100.0, 100.0))
In [12]:
K=10
N_inits=500
max_iter=20
In [17]:
%%timeit
centroids,A_clust,c,=sklearn.cluster.k_means(X, K, init='random',
precompute_distances=True, n_init=N_inits,
max_iter=max_iter, random_state=None, copy_x=True,
n_jobs=1)
The documentation claims that the precompute_distances (True by default) will consume more memory but make the algorithm faster. For the our goal (run K-Means many times with a small number of iterations) this is not the case, as shown by the result below. The documentation does not offer insight on the influence that this parameter is having under the hood.
In [18]:
%%timeit
centroids,A_clust,c,=sklearn.cluster.k_means(X, K, init='random',
precompute_distances=False, n_init=N_inits,
max_iter=max_iter, random_state=None, copy_x=True,
n_jobs=1)
The K-Means function from the scikit-learn already supports CPU parallelization by changing a simple parameter in the function. As expected, using multiple cores resulted in a speedup, albeit small. The parallelization is being done on the pairwise distances computation. It breaks down the pairwise matrix by the number of jobs selected.
In [21]:
%%timeit
centroids,A_clust,c,=sklearn.cluster.k_means(X, K, init='random',
precompute_distances=True, n_init=N_inits,
max_iter=max_iter, random_state=None, copy_x=True,
n_jobs=-1)
In [24]:
%%timeit
centroids,A_clust,c,=sklearn.cluster.k_means(X, K, init='random',
precompute_distances=False, n_init=N_inits,
max_iter=max_iter, random_state=None, copy_x=True,
n_jobs=-1)
It should be noted that it seems that this implementation is being accelerated by compiled C code.n
This are the times from the code above before the installation of optimized MKL sklearn (from the Anaconda Accelerate addon).
1 loops, best of 3: 5.67 s per loop
1 loops, best of 3: 4.05 s per loop
1 loops, best of 3: 3.29 s per loop
1 loops, best of 3: 2.57 s per loop
In [2]:
import numbapro
from numbapro import jit, cuda, int32, float32, float64
#@jit(complex64(int32, float32, complex64), target="cpu")
In [3]:
numbapro.check_cuda()
Out[3]:
In [5]:
#CPU version
@numbapro.vectorize(['float32(float32,float32)',
'float64(float64,float64)'],target='cpu')
def cpu_sincos(x,y):
return math.sin(x) * math.cos(y)
#GPU version
@numbapro.vectorize(['float32(float32,float32)',
'float64(float64,float64)'],target='gpu')
def gpu_sincos(x,y):
return math.sin(x) * math.cos(y)
In [6]:
n=10**6
x = np.linspace(0,np.pi,n)
y = np.linspace(0,np.pi,n)
print 'Data of type ',x.dtype
print 'Unoptimized\t',
%timeit np.sin(x) * np.cos(y)
print 'CPU vectorize\t',
%timeit cpu_sincos(x,y)
print 'GPU vectorize\t',
%timeit gpu_sincos(x,y)
del x,y
Unoptimized and CPU vectorize times are similar because sin() and cos() calls dominate time. GPU is quite faster already and would get even faster in a better GPU (current GPU has 48 CUDA cores).
All the data has to go from the RAM to the GPU memory, band back again. The work is distributed across all threads on the GPU and scheduling and memory management are taken cared of.
In [39]:
@numbapro.vectorize(['float32(float32,float32,float32,float32)'])
def cpu_powers(p,q,r,s):
return math.sqrt(p**2 + q**3 + r**4 + s**5)
@numbapro.vectorize(['float32(float32,float32,float32,float32)'],target='gpu')
def gpu_powers(p,q,r,s):
return math.sqrt(p**2 + q**3 + r**4 + s**5)
When target is not specified, it defaults to 'cpu'.
In [40]:
## Benchmarking
#Generate data
n=5e6
p = np.random.random(n).astype(np.float32)
q = np.random.random(n).astype(np.float32)
r = np.random.random(n).astype(np.float32)
s = np.random.random(n).astype(np.float32)
print 'Unoptimized\t',
%timeit np.sqrt(p**2 + q**3 + r**4 + s**5)
print 'CPU vectorize\t',
%timeit cpu_powers(p,q,r,s)
print 'GPU vectorize\t',
%timeit gpu_powers(p,q,r,s)
del p,q,r,s
Anaconda Accelerate includes the MKL optimization to several libraries. I still have to check if the numpy routines are not faster than the normal ones.
The speedup of GPU over CPU is of $$2.1 s / 69.4 ms = 30.26$$
We've now seen that @vectorize accelerates functions, specially on GPU, but it only works on scalar functions. Even though more complex operations like matrix multiplication can be decomposed in several scalar operations over its elements, it would be more convenient to operate on arrays directly.
In [7]:
@numbapro.guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'],'(m,n),(n,p)->(m,p)',target='gpu')
def batch_matrix_mult(a,b,c):
for i in range(c.shape[0]):
for j in range(c.shape[1]):
tmp=0
for n in range(a.shape[1]):
tmp += a[i,n] * b[n,j]
c[i,j]=tmp
#batch_matrix_mult.max_blocksize = 32 # limits to 32 threads per block
This is the code from the video. It seems to be a simple matrix multiplication though. In the video they say that it's a batch multuplication...
In [9]:
n=4e6
dim=2
print 'Memory for both matrices:\t',((2*n*dim*dim*4)/1024),' KBytes'
a=np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)
b=np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)
print 'Exception when too much resources of GPU are used:'
try:
batch_matrix_mult(a,b)
except Exception as exception:
print exception
batch_matrix_mult.max_blocksize = 64 # limits to 32 threads per block
import numpy.core.umath_tests as ut
print 'Unoptimized\t',
#c = np.dot(a,b)
%timeit c = ut.matrix_multiply(a,b)
print 'GUVectorize GPU\t',
%timeit c2 = batch_matrix_mult(a,b)
del n,dim,a,b
In [23]:
del n,dim,a,b
From the documentation:
There are times when the gufunc kernel uses too many of a GPU’s resources, which can cause the kernel launch to fail. The user can explicitly control the maximum size of the thread block by setting the max_blocksize attribute on the compiled gufunc object.
To control the size of the thread block we use (e.g. for 32 threds per block):
very_complex_kernel.max_blocksize = 32
The speedup was of
In [11]:
n=4e6
dim=2
print 'Memory for both matrices:\t',((2*n*dim*dim*4)/1024),' KBytes'
a=np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)
b=np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)
dc = numbapro.cuda.device_array_like(a)
da = numbapro.cuda.to_device(a)
db = numbapro.cuda.to_device(b)
def check_pure_compute_time(da,db,dc):
batch_matrix_mult(da,db,out=dc)
numbapro.cuda.synchronize()
import numpy.core.umath_tests as ut
print 'Unoptimized\t',
#c = np.dot(a,b)
%timeit c = ut.matrix_multiply(a,b)
print 'Memory management'
batch_matrix_mult.max_blocksize = 64 # limits to 32 threads per block
print 'Automatic:\t\t',
%timeit c=batch_matrix_mult(a,b)
print 'Manual:\t\t\t',
%timeit check_pure_compute_time(da,db,dc)
del n,dim,a,b,da,db,dc
Kernel function
A grid is a group of blocks (that can be 1D,2D or 3D) and each block is a group of threads (which can also be 1D, 2D or 3D). Every thread will execute the same kernel function.
When we're using @jit, we're writing functions on the CUDA execution model. However, we wrote functions that executed on GPU before with a return type (which can't happen here). That is because before we were using the @vectorize and not working under the CUDA execution model.
In [19]:
%qtconsole
In [4]:
import numbapro
from numbapro import *
Let's define a pure Python implementation of the square of the Euclidean distance between matrices a and b. This is very inefficient.
In [5]:
def dist(a,b,c):
N,K = c.shape
dim = a.shape[1]
for n in xrange(N):
for k in xrange(K):
dist=0
for d in xrange(dim):
diff = a[n,d]-b[k,d]
dist += diff ** 2
c[n,k]=dist
This version makes more use of the NumPy module, which has compiled code.
In [6]:
def np_dist(a,b,c):
N,K = c.shape
dim = a.shape[1]
for k in range(K):
dist = a - b[k]
dist = dist ** 2
c[:,k] = dist.sum(axis=1)
In [7]:
@numbapro.guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'],
'(m,n),(p,n)->(m,p)',target='gpu')
def guvect_dist(a,b,c):
N,K = c.shape
dim = a.shape[1]
for n in range(N):
for k in range(K):
dist=0
for d in range(dim):
diff = a[n,d]-b[k,d]
#dist += np.power(diff,2)
dist += diff ** 2
c[n,k]=dist
This function defines a lower level implementation of the distance computation. The code of the kernel will be executed by a thread. We want a thread to compute the value of a distance from one point to one cluster so we'll need $n \times k$ threads to compute the entire distance matrix.
Because each thread corresponds to a point in a matrix we have to know which point that is. Each thread as an index inside the block it belongs too (which can be multidimensional). Each block has an index inside the grid (which can also be multidimensional). The dimension of the grid and the blocks doesn't have to match, which means we can have multidimensional blocks inside a unidimensional grid. These hierarchy is logical.
We use the cuda.grid(2)
function to retrieve the bidimensional global index. That index will be the one used to index the distance matrix.
In [8]:
@numbapro.cuda.jit("void(float32[:,:], float32[:,:], float32[:,:])")
def cu_dist(a,b,c):
"""
## vanilla thread index
# thread ID inside block
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
# block ID
bx = cuda.blockIdx.x
by = cuda.blockIdx.y
# block dimensions
bw = cuda.blockDim.x
bh = cuda.blockDim.y
# compute thread's x and y index (i.e. datapoint and cluster)
k = tx + bx * bw # column for each cluster
n = ty + by * bh # row for each datapoint
"""
"""
## long vertical matrix thread index
# thread ID inside block
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
# block ID
bx = cuda.blockIdx.x
by = cuda.blockIdx.y
# block dimensions
bw = cuda.blockDim.x
bh = cuda.blockDim.y
# grid dimensions
gw = cuda.gridDim.x
gh = cuda.gridDim.y
# compute thread's x and y index (i.e. datapoint and cluster)
k = tx # block width is the same as matrix width
# the second column of blocks means we want to add
# 2**16 to the index
n = ty + by * bh + bx*gh*bh
"""
k,n = numbapro.cuda.grid(2)
ch, cw = c.shape # c width and height
if n >= ch or k >= cw:
return
dist = 0.0
for d in range(a.shape[1]):
diff = a[n,d]-b[k,d]
dist += diff ** 2
c[n,k]= dist
#cuda.syncthreads()
def cu_dist_wrap(a,b,c,blockDim,gridDim):
#stream = cuda.stream()
dA = cuda.to_device(a)
dB = cuda.to_device(b)
#dC = cuda.to_device(c)
#dC = numbapro.cuda.device_array(c.shape)
dC = numbapro.cuda.device_array_like(c)
cu_dist[gridDim,blockDim](dA,dB,dC)
numbapro.cuda.synchronize()
#cu_dist[gridDim,blockDim,stream](dA,dB,dC)
#dC.to_host()
#stream.synchronize() #block untill stream is done
#result=dC.copy_to_host()
dC.copy_to_host(ary=c)
# print 'type of result=',type(result)
#c=result
#return result
The maximum number of blocks in each grid dimension is $2^{16}$. For a big matrix (one of the dimensions bigger than $2^{16}$)
In [14]:
MAX_GRID_XYZ_DIM = 65535
MAX_BLOCK_XY_DIM = 1024
MAX_BLOCK_Z_DIM = 64
#%debug
##generate data
n = 1e4
d = 2
k = 20
total_bytes = (n * d + k * d + n * k) * 4
print 'Memory used by arrays:\t',total_bytes/1024,'\tKBytes'
print '\t\t\t',total_bytes/(1024*1024),'\tMBytes'
print 'Memory used by data: \t',n * d * 4 / 1024,'\t','KBytes'
data = np.random.random((n,d)).astype(np.float32)
clusters = np.random.random((k,d)).astype(np.float32)
#dists = np.empty((n,k)).astype(np.float32)
distsDim = np.int(n),np.int(k)
blockDim = 20,25 # GT520M has 48 CUDA cores
tpg = np.prod(blockDim)
# blocks per grid
bpg = np.ceil( np.float(n * k) / tpg )
gridWidth = np.ceil(bpg / MAX_GRID_XYZ_DIM)
gridHeight = np.ceil(bpg / gridWidth)
gridDim = np.int(gridWidth),np.int(gridHeight)
total_threads = np.prod(gridDim) * np.prod(blockDim)
needed_threads = n*k
print 'Block dimension: \t',blockDim
print 'Grid dimension: \t',gridDim
print 'Threads per block:\t',tpg
print 'Blocks per grid: \t',bpg
print 'Total threads: \t',total_threads
print 'Excesss threads: \t', total_threads - needed_threads
if gridDim[0] > MAX_GRID_XYZ_DIM or gridDim[1] > MAX_GRID_XYZ_DIM:
print "\n* * * * * * * * * * * * * * * * * * * * * * * * * * *\n"
print "WARNING: Grid dimension too big. Maximum dimension = ", MAX_GRID_XYZ_DIM
print "\n* * * * * * * * * * * * * * * * * * * * * * * * * * *"
In [50]:
my_dev=cuda.get_current_devicet_device()
cuda.api.driver.Device.reset(my_dev)
cuda.np.
Out[50]:
In [30]:
%qtconsole
In [43]:
#%debug
#distsPython = np.empty(distsDim, dtype=np.float32)
distsNumPy = np.empty(distsDim, dtype=np.float32)
distsCUDA_auto = np.ones(distsDim, dtype=np.float32)*(-1)
distsCUDA_manual = np.ones(distsDim, dtype=np.float32)*(-1)
distsCUDA_gu = np.empty(distsDim, dtype=np.float32)
#dist(data,clusters,distsPython)
np_dist(data,clusters,distsNumPy)
guvect_dist.max_blocksize = 32 # limits to 32 threads per block
#distsCUDA_gu=guvect_dist(data,clusters)
cu_dist_wrap(data,clusters,distsCUDA_manual,blockDim,gridDim)
cu_dist[gridDim,blockDim](data,clusters,distsCUDA_gu)
#print 'Python = NumPy\t\t',np.allclose(distsPython,distsNumPy)
print 'NumPy = CUDA manual\t',np.allclose(distsNumPy,distsCUDA_manual)
print 'NumPy = CUDA auto\t',np.allclose(distsNumPy,distsCUDA_auto)
#print 'CUDA auto = manual\t',np.allclose(distsCUDA_auto,distsCUDA_manual)
print 'NumPy = CUDA gu\t',np.allclose(distsNumPy,distsCUDA_gu)
if 'distsPython' in locals(): del distsPython
if 'distsNumPy' in locals(): del distsNumPy
if 'distsCUDA_auto' in locals(): del distsCUDA_auto
if 'distsCUDA_manual' in locals(): del distsCUDA_manual
if 'distsCUDA_gu' in locals(): del distsCUDA_gu
In [29]:
dists = np.empty(distsDim).astype(np.float32)
print 'Pure python:\t\t',
%timeit -r 1 dist(data,clusters,dists)
print 'NumPy optimized:\t',
%timeit np_dist(data,clusters,dists)
#print 'GUVectorize:\t\t',
#%timeit -r 1 guvect_dist(data,clusters,dists)
print 'CUDA auto:\t\t',
%timeit cu_dist[gridDim,blockDim](data,clusters,dists)
print 'CUDA manual:\t\t',
%timeit cu_dist_wrap(data,clusters,dists,blockDim,gridDim)
del dists
Pure python: 1 loops, best of 1: 32min 51s per loop
NumPy optimized: 1 loops, best of 3: 6.93 s per loop
CUDA auto: 1 loops, best of 3: 1.31 s per loop
CUDA manual: 1 loops, best of 3: 1.13 s per loop
In [31]:
speedup=34*60 / 1.13
print speedup
In [32]:
del data,clusters
In [2]:
import numbapro
from numbapro import *
In [15]:
from timeit import default_timer as timer
bpg = 150
tpb = 6
n = bpg * tpb
shared_mem_size = (tpb, tpb)
griddim = bpg, bpg
blockdim = tpb, tpb
# Prepare data on the CPU
#A = np.array(np.random.random((n, n)), dtype=np.float32)
#B = np.array(np.random.random((n, n)), dtype=np.float32)
A = np.ones((n, n), dtype=np.float32)
B = np.ones((n, n), dtype=np.float32)
print "N = %d x %d" % (n, n)
In [3]:
@numbapro.cuda.jit("void(float32[:,:], float32[:,:], float32[:,:])")
def naive_matrix_mult(A, B, C):
x, y = cuda.grid(2)
if x >= C.shape[1] or y >= C.shape[0]:
return
C[y, x] = 0
for i in range(A.shape[1]):
C[y, x] += A[y, i] * B[i, x]
def py_naive_matrix_mult(A, B, C):
for y in xrange(C.shape[0]):
for x in xrange(C.shape[1]):
C[y, x] = 0
for i in xrange(A.shape[1]):
C[y, x] += A[y, i] * B[i, x]
In [17]:
# Prepare data on the GPU
dA = cuda.to_device(A)
dB = cuda.to_device(B)
dC = cuda.device_array_like(A)
# Time the unoptimized version
s = timer()
naive_matrix_mult[griddim, blockdim](dA, dB, dC)
numbapro.cuda.synchronize()
e = timer()
unopt_ans = dC.copy_to_host()
tcuda_unopt = e - s
# Time the Python version
py_ans = np.empty(unopt_ans.shape)
s = timer()
#py_naive_matrix_mult(A, B, py_ans)
py_ans=A.dot(B)
numba.cuda.synchronize()
e = timer()
t_py = e - s
print 'Times'
print 'CUDA \t',tcuda_unopt
print 'Python\t',t_py
print ''
print 'Are the same:\t',np.allclose(unopt_ans,py_ans)
In [5]:
bpg = 150
tpb = 6
n = bpg * tpb
shared_mem_size = (tpb, tpb)
griddim = bpg, bpg
blockdim = tpb, tpb
A = np.ones((n, n), dtype=np.float32)
B = np.ones((n, n), dtype=np.float32)
print "N = %d x %d" % (n, n)
print "Memory:\t",A.nbytes*3 / 1024, '\t','KBytes'
In [6]:
# First operation
res1 = A.dot(B)
dA = cuda.to_device(A)
dB = cuda.to_device(B)
dC = cuda.device_array_like(A)
naive_matrix_mult[griddim, blockdim](dA, dB, dC)
numbapro.cuda.synchronize()
cuda_res1 = dC.copy_to_host()
B += 1
# Second operation
res2 = A.dot(B)
dB = cuda.to_device(B)
dC = cuda.device_array_like(A)
naive_matrix_mult[griddim, blockdim](dA, dB, dC)
numbapro.cuda.synchronize()
cuda_res2 = dC.copy_to_host()
# check results
print 'Result 1 check:','\t',np.allclose(res1,cuda_res1)
print 'Result 2 check:','\t',np.allclose(res2,cuda_res2)